Forthcoming,  Proceedings  of  the  31st  Conference  on  Uncertainty  in  Artificial  Intelligence,  2015. 


TECHNICAL  REPORT 
R-454 
July  2015 


Missing  Data  as  a  Causal  and  Probabilistic  Problem 


Ilya  Shpitser 

Mathematical  Sciences 
University  of  Southampton 
Southampton,  UK  SO  14  6WD 

i.shpitser@soton.ac.uk 


Karthika  Mohan 

Dept,  of  Computer  Science 
Univ.  of  California,  Los  Angeles 
Los  Angeles,  CA  90095 

karthikagcs .ucla.edu 


Judea  Pearl 

Dept,  of  Computer  Science 
Univ.  of  California,  Los  Angeles 
Los  Angeles,  CA  90095 

judea@cs .ucla.edu 


Abstract 

Causal  inference  is  often  phrased  as  a  missing 
data  problem  -  for  every  unit,  only  the  response 
to  observed  treatment  assignment  is  known,  the 
response  to  other  treatment  assignments  is  not. 

In  this  paper,  we  extend  the  converse  approach 
of  [7]  of  representing  missing  data  problems  to 
causal  models  where  only  interventions  on  miss¬ 
ingness  indicators  are  allowed.  We  further  use 
this  representation  to  leverage  techniques  devel¬ 
oped  for  the  problem  of  identification  of  causal 
effects  to  give  a  general  criterion  for  cases  where 
a  joint  distribution  containing  missing  variables 
can  be  recovered  from  data  actually  observed, 
given  assumptions  on  missingness  mechanisms. 

This  criterion  is  significantly  more  general  than 
the  commonly  used  “missing  at  random”  (MAR) 
criterion,  and  generalizes  past  work  which  also 
exploits  a  graphical  representation  of  missing¬ 
ness.  In  fact,  the  relationship  of  our  criterion  to 
MAR  is  not  unlike  the  relationship  between  the 
ID  algorithm  for  identification  of  causal  effects 
[22,  18],  and  conditional  ignorability  [13]. 

1  INTRODUCTION 

Missing  data  is  a  ubiquitous  problem  in  data  analysis,  and 
can  arise  due  to  imperfect  data  collection,  or  various  types 
of  censoring,  for  instance  via  loss  to  followup,  or  death.  In 
addition,  causal  inference  can  be  viewed  as  a  missing  data 
problem,  since  the  fundamental  problem  of  causal  infer¬ 
ence  [4]  is  that  for  every  unit  only  the  response  to  observed 
treatment  assignment  is  known,  the  responses  to  other,  hy¬ 
pothetical  treatment  assignments  are  not  known. 

Handling  missing  data  entails  either  dealing  with  a  la¬ 
tent  variable  model  or  finding  plausible  assumptions  under 
which  recoverability,  that  is  unbiased  inferences  about  all 
cases  from  the  observed  cases,  is  possible.  Well-known  ap¬ 
proaches  of  the  former  type  include  fitting  a  latent  variable 


model  via  gradient  descent  [17],  the  EM  algorithm  [1],  or 
performing  Monte  Carlo  averaging  via  multiple  imputation 
[16],  Well-known  approaches  of  the  latter  type  include  the 
Kaplan-Meier  estimator  in  survival  analysis  [5],  and  adjust¬ 
ments  based  on  Missing  Completely  At  Random  (MCAR), 
and  Missing  At  Random  (MAR)  assumptions  [15], 

While  methods  based  on  inference  in  a  latent  variable 
model  are  more  generally  applicable,  they  are  also  method¬ 
ologically  and  computationally  challenging.  At  the  same 
time,  recoverability  methods  based  on  MCAR  and  MAR 
rely  on  strong  assumptions  on  how  missingness  comes 
about.  When  neither  MCAR  nor  MAR  holds,  data  is  said 
to  be  Missing  Not  At  Random  (MNAR),  and  in  this  case 
a  characterization  of  recoverability  is  an  open  problem, 
although  many  sufficient  conditions  for  recoverability  are 
known  [7,  6]. 

In  this  paper,  we  take  the  converse  view  to  “causality  as 
missing  data,”  and  view  missing  data  as  a  particular  type 
of  partly  causal,  and  partly  probabilistic  inference  prob¬ 
lem  [2,  7],  We  then  represent  this  problem  using  partly 
causal,  and  partly  probabilistic  graphical  models,  and  ex¬ 
ploit  techniques  developed  for  similar  models  in  the  con¬ 
text  of  identification  of  causal  effects  to  develop  a  general 
algorithm  for  recoverability  under  MNAR.  In  fact,  the  rela¬ 
tionship  between  our  algorithm  and  MAR  is  not  unlike  the 
relationship  between  the  ID  algorithm  for  identification  of 
causal  effects  [22,  18,  19],  and  the  conditional  ignorability 
assumption  in  causal  inference  [13], 

The  paper  is  organized  as  follows.  We  introduce  the  no¬ 
tation  and  concepts  we  will  need  in  section  2.  In  section 
3,  we  use  missingness  graphs  and  missingness  models  to 
formally  define  missing  data  as  a  type  of  causal  inference 
problem  where  only  interventions  on  certain  variables  are 
allowed.  We  introduce  recoverability  and  give  examples  of 
where  recoverability  is  possible  in  MNAR  settings  in  sec¬ 
tion  4.  We  introduce  a  general  algorithm  for  recoverabil¬ 
ity  we  call  MID  in  section  5,  and  show  it  is  sound.  Sec¬ 
tion  6  illustrates  a  complex  case  where  the  entire  recursive 
structure  of  MID  is  necessary.  Section  7  discusses  non¬ 
recoverability,  and  section  8  contains  our  conclusions. 
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2  PRELIMINARIES 

Variables  are  capital  letters,  values  are  small  letters.  Vari¬ 
able  sets  are  bold  capital  letters,  value  sets  are  bold  small 
letters.  A  state  space  for  a  variable  A  is  X4.  A  state  space 
for  a  set  of  variables  A  is  the  Cartesian  product  of  the  indi¬ 
vidual  state  spaces:  Xa  =  x  4^X4.  For  a  set  of  values  a, 
and  B  C  A,  denote  by  ag  a  projection  of  a  to  B.  Denote 
a b  as  a  shorthand  for  a{s}-  We  will  denote  a  vector  of  Os 
as  0.  Ob  means  “a  set  of  0  values  to  B." 

2.1  GRAPH  THEORY  AND  NOTATION 

A  directed  graph  consists  of  a  set  of  nodes  and  directed 
arrows  (— >)  connecting  pairs  of  nodes.  A  mixed  graph  con¬ 
sists  of  a  set  of  nodes  and  directed  and/or  bidirected  arrows 
(o)  connecting  pairs  of  nodes.  A  path  is  a  sequence  of 
distinct  edges  where  any  edge  in  a  sequence  that  ends  in  a 
node  A  implies  the  subsequent  edge  must  start  with  A,  and 
each  such  node  A  may  only  occur  at  most  once  in  this  way 
in  the  sequence.  A  directed  path  from  a  node  A  to  a  node 
Y  is  a  path  consisting  of  directed  edges  where  all  edges  on 
the  path  point  away  from  X  and  towards  Y. 

If  the  edge  X  — »  Y  exists  in  a  graph  £7,  we  say  X 
is  a  parent  of  Y  and  Y  is  a  child  of  X.  If  a  di¬ 
rected  path  from  X  to  Y  exists  in  £7,  we  say  X  is 
an  ancestor  of  Y,  and  Y  is  a  descendant  of  X.  We 
denote  by  pag(A),  chg(A),  deg(A),  ang(A),  ndg(A)  the 
sets  of  parents,  children,  descendants,  ancestors,  and  non¬ 
descendants  of  A  in  £7 ,  respectively.  These  are  defined 
disjunctively  for  sets,  e.g.  paG(A)  =  |J46Apae(A). 
Let  fag  (A)  =  pag(A)  U  {A},  pag(A)  =  pag(A)  \  A, 
ndpg(A)  =  ndg(A)  \  pag(A).  Given  a  graph  £7,  we  say  a 
vertex  set  A  is  ancestral  if  ang(A)  =  A.  By  convention, 
in  any  directed  graph,  A  £  ang(A)  (~l  deg  (A).  A  directed 
graph  is  said  to  have  a  directed  cycle  if  there  is  X ,  Y  such 
that  A  £  ang(Y)nclig(Y).  A  directed  graph  without  such 
cycles  is  called  a  directed  acyclic  graph  (DAG). 

A  conditional  DAG  (CDAG)  £7(V  |  W)  is  a  DAG  with 
vertices  V  U  W  with  the  property  that  pae(W)  =  0.  We 
will  denote  vertices  in  V  as  circles,  and  vertices  in  W 
as  squares.  Note  that  we  do  not  require  that  all  V  £  V 
must  have  parents.  We  simply  distinguish  certain  parent¬ 
less  nodes  in  Q  as  W.  We  will  interpret  vertices  in  V  as 
associated  with  random  variables  and  vertices  in  W  as  as¬ 
sociated  with  variables  that  have  been  “set  to  a  constant”  in 
some  way.  One  example  of  a  CDAG  is  a  mutilated  graph 
that  arises  in  the  analysis  of  interventional  distributions. 
When  considering  d-separation  on  vertices  in  V  in  a  CDAG 
[9],  we  will  treat  it  as  ordinary  d-separation  in  a  DAG,  ex¬ 
cept  all  nodes  in  W  are  implicitly  conditioned  on. 

If  vertices  not  in  W  in  a  CDAG  correspond  to  a  variable 
partition  into  observed  and  missing  variables,  we  will  ex¬ 
plicitly  denote  the  set  of  vertices  corresponding  to  miss¬ 


ing  variables  as  M,  and  the  other  vertices  as  O,  like  so: 
£7(0,  M  |  W).  A  CDAG  where  W  is  empty  is  written  as 
£7(V)  or  ^(O,  M)  as  a  shorthand. 

A  conditional  acyclic  directed  mixed  graph  (CADMG) 
£7(V  |  W)  is  a  mixed  graph  with  two  types  of  edges  — ► 
and  with  no  directed  cycles,  where  no  arrowhead  may 
point  to  an  element  of  W.  We  will  sometimes  omit  vari¬ 
ables  from  CDAGs  and  CADMGs  if  they  are  obvious  to 
avoid  notation  clutter,  e.g.  we  will  write  £7(V  |  W)  simply 
as  £7 •  Given  a  CDAG  £7(0,  M  |  W),  define  Gb{G)  to  be 
an  edge  subgraph  obtained  from  G  by  removing  all  arrows 
pointing  away  from  B. 

Define  a  latent  projection  ofG{ 0,M  |  W)  onto  O  U  W 
[23]  to  be  a  CADMG  £(o)(C>,M  W)  e  g*(0  W) 
such  that  for  any  V[ ,  V2  G  OUW: 

•  There  is  an  edge  V\  -£Vi  if  and  only  if  there  is  a  di¬ 
rected  path  V\ — t . . .  -YVi  in  G{0,  M  |  W)  with  all 
intermediate  nodes  in  M. 

•  There  is  an  edge  V'j  o \j>  if  and  only  if  there 
is  a  marginally  d-connected  path  V\  -XVi  in 
£7( O,  M  |  W)  with  all  intermediate  nodes  in  M. 

Latent  projections  are  a  simplified  representation  of  an  in¬ 
finitely  large  class  of  hidden  variable  CDAGs  with  struc¬ 
tural  features  in  common.  In  this  paper,  we  use  them  only 
to  simplify  the  statements  and  proofs  of  our  results.  The  re¬ 
sults  themselves  will  always  be  about  models  represented 
by  DAGs  (and  CDAGs). 

Given  a  CDAG  G(V  |  W),  and  A  C  V  U  W,  define 
£a(V  I  W)  =  G(V  n  A  I  W  n  A)  be  a  subgraph  of 
G  containing  the  vertex  set  A  and  any  edge  in  G  between 
elements  in  A. 

Given  a  CADMG  £7(V  |  W),  and  l-7  £  V,  define  the  dis¬ 
trict  (or  c-component  [22,  18])  ofV  in  £7(V  |  W)  to  be 
disg(Y)  =  {A  £  V  |  kf>...  The  set  of  districts  of 

£7(V  |  W)  is  denoted  by  "Z?(£7(V  |  W)),  and  it  partitions 
V. 

For  any  V  £  O  in  a  CDAG  £7(0,  M  |  W)  where  for  ev¬ 
ery  M  £  M,  deg  (M)  flO  /  I,  define  the  clan  of  V 
as  clag(E)  =  angDvUM(Dy),  where  Bv  =  disg(0)  (V). 
For  example,  in  G  shown  in  Fig.  1  (c),  where  {A,  W}  are 
missing,  clag(i?x)  =  clap(SV)  =  {W7  Rx,  SV},  and 
clag (Rw)  =clag(Sx)  =  {X,  Rw,  Sx}- 

For  any  D  £  7?(£7(o)(0,M  |  W)),  and  Di,D2  £  D, 
clag(Di)  =  clag(L>2).  Thus  we  will  write  clag(D)  = 
clag(D),  for  any  D  £  D.  In  fact,  the  set  of  clans  partitions 
O  U  M  in  G  with  the  property  above. 

Given  a  CDAG  G,  a  total  ordering  -<  on  vertices  in  G  is 
topological  given  G  if  A  -<  B  implies  A  qL  deg(B).  Given 
an  ordering  -<  topological  given  £7,  define  for  any  vertex  V 


in  Q ,  pregiX(V)  =  { W  ^  V  \  W  -<  V}  .  Given  -<  topo¬ 
logical  for  Q  with  a  vertex  set  V,  if  there  is  a  subgraph  Q' 
of  Q  with  a  vertex  set  V'  C  V,  we  will  view  -<  with  respect 
to  Q'  as  the  natural  subordering  restricted  to  V.  Note  that 
this  subordering  will  also  be  topological  with  respect  to  Q' . 

A  counterfactual  (potential  outcome)  Y  (a)  [8,  14]  is  a  re¬ 
sponse  Y  to  a  hypothetical  assignment  of  a  set  of  treat¬ 
ments  A  to  values  a.  Given  a  set  of  potential  outcomes 
Yi(a), . . .  Yfc(a),  where  Y  =  {Y1; . . .  Y \},  we  may  con¬ 
sider  a  joint  distribution 

p({Yi,  •  •  ■  yfc}(a))  =  p(Y(a))  =  p(Y  |  do(a)). 

The  do(.)  notation  is  discussed  extensively  in  [10]. 

3  MISSING  GRAPHS  AND 
MISSINGNESS  MODELS 

Given  a  CDAG  £7(V  |  W),  we  say  Pw(V)  (a  mapping 
from  l*w  to  p(V))  is  Markov  relative  to  Q  if 

Pw(V)  =  n  pw{V  I  pa,g(V)  \  W),  (1) 

vev 

and  each  term  pwiY  |  p &g(V)  \  W)  only  depends  on 
wnpag(y). 

Definition  1  (missingness  graph)  Given  a  DAG 
£7(0,  M),  a  DAG  Qm  is  called  a  missingness  graph 
for  £7  if  Qm  has  the  vertex  set  O  U  M  U  Rm  U  Sm,  where 
Rm  =  {Rm  |  M  G  M},  Sm  =  {Sm  \  M  €  M},  £7  = 
t/QUM,  and  for  all  M  in  M,  pa,gm(SM )  =  {M,Rm}, 
ch pm (5m)  =  0,  and  ch g™(RM)  (T  (O  U  M)  =  0. 

By  convention,  if  M  =  0,  then  S@  =  R@  =  0.  We  will 
refer  to  O  U  Rm  U  Sm  as  V,  and  to  V  U  M  as  A.  We  call 
elements  of  Rm  indicators,  and  elements  of  Sm  proxies. 

Define  A4(Gm( A))  to  be  the  missingness  model  for 
a  missingness  graph  £7m( A)  as  a  set  of  distributions 
{p(A)|  over  the  following  set  of  counterfactuals  A  = 
{A(r)|R  C  Rm,  r  G  Xr}  ,  such  that  (VM  G  M)  Xrm  = 
{0, 1},  %Sm  =  U  {missing},  and  the  missingness 
mechanism  that  determines  the  value  of  Sm  is  as  follows: 
Sm{ 0rm)  =  M  and  Sm(1rm)  =  missing.  In  addition: 
(VR  C  Rm,  r  G  Xr,  V  G  A), 

V(r)  JL  {ndpem (Y)}(r)  |  {pa6™(Y)}(r).  (2) 

To  obtain  the  set  A,  we  first  define 

{A(r)|r  G  XRm}  =  {SM(r),0,M|r  G  XRm}, 

and  obtain  the  others  via  modified  recursive  substitution  as 
in  definition  43  in  [1 1],  pp.  100-101. 

A  missingness  model  is  thus  really  a  particular  type  of 
a  graphical  causal  model  where  we  only  define  interven¬ 
tions  on  a  subset  of  variables  [11],  In  particular,  we  allow 


£7(0,  M)  to  represent  an  ordinary  hidden  variable  statis¬ 
tical  model.  (2)  is  just  the  DAG  local  Markov  property 
linking  p(A(r))  and  £7r.  for  every  r.  If  we  had  chosen  to 
split  variables  in  R  into  random  and  intervened  versions, 
and  display  both  explicitly  in  the  graph  rather  than  only 
displaying  the  random  version  of  variables,  and  keeping  in¬ 
tervened  versions  implicit,  as  we  do  in  £7r,  we  would  end 
up  with  Single  World  Intervention  Graphs  (SWIGs),  and 
the  appropriate  local  Markov  property  for  those  graphs,  as 
discussed  in  [11], 

Standard  results  on  DAG  models  imply  (2)  is  equivalent 
to  (1)  for  p( A(r))  and  £7r  (if  we  let  W  =  0,  and  keep 
fixed  versions  of  R  implicit  in  the  graph).  We  may  also  let 
W  =  R,  and  treat  R  as  a  split  node  as  in  a  SWIG. 

4  RECOVERABILITY 

We  call  p(V)  the  manifest  distribution.  A  functional  of 
p{ A),  f(p{ A))  is  said  to  be  recoverable  given  p(V)  in  £7m 
if  there  is  a  functional  g  of  p(V),  such  that  f(p( A))  = 
g(p(V))  for  every  element  of  A4(Gm).  In  this  paper,  we 
will  concentrate  on  recoverability  of  p( O  U  M),  although 
many  other  kinds  of  recoverability  problems  are  also  inter¬ 
esting,  for  instance  recovering  the  causal  effect  in  a  causal 
model  with  missingness. 

We  explicitly  represent  missingness  as  a  causal  inference 
problem  because  this  allows  us  to  rephrase  recoverability 
as  identifiability  of  causal  effects.  If  we  were  allowed  to 
assign  Rm  without  affecting  other  variables,  we  could  use 
proxies  Sm  to  recover  the  behavior  of  the  underlying  miss¬ 
ing  variables  M,  due  to  the  following  result. 

Lemma  1  In  a  DAG  Q  where  M  ^  0,  for  any  p{ A)  G 
A4(£m(V,M)),  and  RM  G  Rm,  p( Y)  G  M(gm(V  U 
{M},M\  {M})vum\{sm,-Rm})’  where  Y  is 

{{VUM\{f?M}}(r,0«M)|RC  R,M\{Af }  ,  r  €  Xr}  . 


Proof:  {V  U  M}(r,OflM)  obeys  (2)  for  £ru{Hm}- 
Since  A  \  {Rm}  is  ancestral  in  C/ru/ri,  {V  U  M  \ 
{Rm}}(t,0rm)  obeys  (2)  for  (Gru{Rm})a\{rm}-  Our 
conclusion  follows  since  M  =  Sm{0rm).  □ 

In  other  words,  fixing  Rm  to  0  gives  a  new  model  where 
M  is  effectively  observed  since  M  =  Sm(0rm)-  This 
implies  that  if  we  were  able  to  fix  all  of  Rm-  we  could 
recover  p(0  U  M). 

Corollary  1  p({0,  Sm}(0rm))  =  p{OGM)  for  any  Qm, 
and  any  p(A)  G  A4(Gm). 

This  corollary  implies  that  our  recoverability  problem  is 
solved  by  expressing  a  particular  interventional  distribution 


as  a  function  of  the  manifest  in  a  restricted  causal  model. 
We  will  attack  this  problem  via  two  standard  results  for 
causal  models  that  hold  in  restricted  causal  models  as  well, 
as  shown  in  [11],  propositions  45  and  46. 

Theorem  1  For  any  p( A)  G  A4(f/m(V.  M)),  and  (VR  C 
Rm;  r  G  Xr), 


P(A(r))  =  l[  p(V  I  pagm(y)  \  R,  rpagm(y)nR)-  (3) 

VGA 

Theorem  2  For  any  p( A)  G  A4(f/m(V,  M)),  and  (VR  C 
Rm,  r  G  £r), 

p({(V  U  M)  \  R}(r)  |  r)  =  p((V  UM)\R|r).  (4) 

(3)  is  known  as  the  truncated  factorization  [10],  manipu¬ 
lated  distribution  [21],  or  the  g-formula  [12],  (4)  is  known 
as  the  consistency  property. 

We  now  illustrate  how  constraints  of  the  missingness  model 
encoded  by  Qm ,  as  well  as  (3)  and  (4)  lead  to  recoverability. 


Figure  1:  (a)  A  missingness  model  satisfying  the  missing 
completely  at  random  (MCAR)  assumption,  (b)  A  miss¬ 
ingness  model  satisfying  the  missing  at  random  (MAR)  as¬ 
sumption.  (c)  A  missingness  model  where  missingness  is 
not  at  random  (MNAR),  but  where  recoverability  is  never¬ 
theless  possible. 

That  is,  data  on  X ,  W  is  not  missing  completely  at  random 
(nor  at  random,  since  there  is  no  fully  observed  variable 
to  screen  off  the  dependence  of  proxies  under  indicator  as¬ 
signment  from  indicators.)  Nevertheless,  despite  the  fact 
that  data  on  p(X,  W)  is  missing  not  at  random  (MNAR), 
we  now  show  that  p(X,  W)  is  recoverable.  We  will  exploit 
the  fact  that  the  missingness  model  implies 


4.1  EXAMPLES  OF  RECOVERABILITY 

Consider  Fig.  1,  where  X ,  C,  W  may  possibly  be  high¬ 
dimensional.  In  Fig.  1  (a),  X  is  missing  according  to  a 
mechanism  governed  by  an  independent  proxy  Rx,  so 

p( X)  =  p(Sx(0rx))  =p{Sx  |  Rx  =  0). 

The  assumption  present  in  this  model  which  allows 
us  to  recover  the  underlying  missing  variable,  namely 
(Sa'(0rx)  _LL  Rx)  is  known  as  missing  completely  at  ran¬ 
dom  (MCAR)  assumption.1  This  assumption  is  the  miss¬ 
ingness  analogue  of  ignorability  (lack  of  confounding  be¬ 
tween  the  missingness  indicator  Rx  and  the  proxy  Sx  (r) 
under  assignment  r  to  Rx)- 

In  Fig.  1  (b),  X  is  missing  according  to  a  mechanism  gov¬ 
erned  by  a  proxy  Rx  which  has  a  (statistical)  dependence 
on  X  through  C ,  which  is  a  fully  observed  variable.  In  this 
case, 

p(X,C)  =p(Sx(0Rx)  |  C)p(C)  =p(Sx  |  Rx  =  0 ,C)p(C). 

The  assumption  present  in  this  model  which  allows 
us  to  recover  the  underlying  missing  variable,  namely 
(Sx  (0rx  )  -LL  Rx  |  C)  is  known  as  the  missing  at  random 
(MAR)  assumption.  This  assumption  is  the  missingness 
analogue  of  conditional  ignorability  (lack  of  confounding 
between  the  indicator  R x  and  the  proxy  Sx  (r)  under  as¬ 
signment  r  to  Rx  given  that  we  conditioned  on  a  set  of 
variables  C). 

In  Fig.  1  (c),  it  is  not  the  case  that 

{‘SvK0.rw),Sx(0.Rx)}  -U-  {Rx,Rw}- 

1  _LL  is  the  independence  symbol. 


{Sw(0rw)iRx(0rw)}  -LL  {Sx(0.rx),  Rw}  ■  (5) 

It  is  not  difficult  to  show  that  p(R\y,  Rx,  Sw ,  Sx)  is  equal 
to 

p({Sw, Rx}(Rw))  ■  p(Sx(Rx), Rw)  = 

( p(Sw  |  Rx,  Rw)p(Rx  |  Rw))  •  ( p(Sx  |  Rx,Rw)p(Rw)) 

This  implies  p(X,  W)  =  p(X)p(W)  is  equal  to 
( y^K{£w,-Rx}(0.Rty))  J  •  (  5^p(Sx(0rx),Rw)  I  = 

\Rx  )  \%  / 

p(Sw  I  0hw)  •  I  ^ ~2p(Sx  |  0 rx  ,  Rw)p(Rw)  j 
\%  / 

The  key  to  this  example  is  the  joint  independence  (5);  inde¬ 
pendences  of  this  type  arise  in  hidden  variable  DAG  mod¬ 
els.  We  give  an  example  later  where  recoverability  is  based 
not  on  an  ordinary  independence,  but  on  a  generalized  in¬ 
dependence,  or  Verma  constraint  [23,  20].  In  the  following 
sections,  we  give  a  general  recursive  scheme  for  solving 
recoverability  problems  under  MNAR  using  these  types  of 
constraints. 

4.2  KNOWN  RESULTS  FOR  MISSINGNESS 
GRAPHS 

Recently  [7]  and  [6]  have  used  missingness  graphs  to  de¬ 
rive  conditions  for  recoverability  when  data  is  MNAR.  In 
particular,  the  following  characterization  appears  in  [7]  (as 
theorem  2). 


Theorem  3  For  any  p( A)  €  Ai(Gm(V,  M)),  if  no  ele¬ 
ments  o/Rm  are  adjacent  in  Gm(V,  M),  then  p(0  U  M) 
is  recoverable  from  p(0,  Sm,  0rm)  if  and  only  if  M 
pa q(Rm)  for  any  M  £  M.  Moreover,  p(0  U  M)  is  equal 
to 

_ p(Q,  Sm,  Orm) _ 

II  P  (®Rm  Pag(^M)  \  M,  Spag(flM)nM)  ^Rpag (%)nM  J 

This  result  can  be  generalized  in  three  directions.  We  may 
consider  cases  where  variables  are  unobserved  and  no  miss¬ 
ingness  mechanism  exists.  We  may  consider  recoverability 
of  other  queries  than  p(0  U  M),  for  instance  causal  effects 
or  marginal  distributions.  Finally,  we  may  consider  cases 
where  elements  of  Rm  are  adjacent.  This  case  is  impor¬ 
tant  because  it  represents  important  classes  of  missingness 
such  as  monotonic  missingness  due  to  loss  to  followup.  A 
unit  that  drops  out  of  a  longitudinal  study  at  time  t  often  re¬ 
mains  dropped  out  at  times  t  + 1, . . ..  In  our  framework,  we 
would  code  this  by  requiring  that  for  all  t'  >  t,  Rm  ,  =  1 
if  Rm  =  1,  where  Mt  is  unit’s  status  at  time  t.  But 
this  coding  is  only  possible  if  indicators  are  allowed  to  be 
adjacent  in  the  graph.  In  addition,  allowing  indicators  to  be 
adjacent  allows  us  to  model  non-monotone  missing  data , 
where  a  unit  may  be  missing  at  a  particular  time  t,  but  then 
becomes  observed  at  a  later  time  t  +  k. 

In  this  paper,  we  consider  the  problem  of  recovering  p(OU 
M)  given  that  every  missing  variable  has  an  indicator  and 
a  proxy  (e.g.  no  completely  hidden  variables),  and  that  in¬ 
dicators  Rm  are  allowed  to  be  adjacent.  We  give  a  re¬ 
coverability  algorithm  that  generalizes  earlier  work  in  this 
setting. 

5  A  GENERAL  RECOVERABILITY 
ALGORITHM 


is  recursive,  which  means  the  input  must  also  keep  track  of 
a  set  W  representing  variables  the  clan  subproblem  ends 
up  depending  on. 

The  situation  is  somewhat  analogous  to  the  way  in  which 
the  ID  algorithm  attempts  to  identify  controlled  direct  ef¬ 
fects  p(Y  |  do(vpag(r)))  =  p(F(vpag(r))),  with  three 
major  differences.  First,  we  are  attempting  identification 
in  a  setting  where  some  variables  start  off  being  treated  as 
hidden,  but  in  the  course  of  the  recursion  of  DIR  become 
observed  due  to  fixing  indicators  to  0.  In  ID  variables  are 
always  either  hidden  or  observed  and  do  not  change  status. 
Second,  since  we  are  only  allowed  to  intervene  on  indica¬ 
tors,  we  are  attempting  to  identify 

p(RM{ ORMnpag(flM))  I  {p&s(Rm)  \  RM}(0RMnpag(HM)))- 

Finally,  there  is  not  necessarily  a  fully  interventional  in¬ 
terpretation  for  the  intermediate  objects  pw(-)  that  arise 
during  the  execution  of  DIR,  since  W  may  contain  ele¬ 
ments  outside  Rm-  This  is  a  necessary  consequence  of  our 
insistence  on  not  imposing  a  causal  model  on  p(M  U  O). 
Intermediate  objects  that  arise  during  the  execution  of  ID 
can  always  be  interpreted  as  interventional  distributions. 

5.1  SOUNDNESS 

MID  and  its  subroutine  DIR  appear  below  as  algorithm  1 . 
In  this  section,  we  prove  that  MID  is  sound. 

Corollary  1  implies  that  if  were  able  to  express 
p({  o,  Sm}(0rm))  as  a  function  of  the  manifest  distri¬ 
bution,  we  would  solve  the  recoverability  problem  for 
p(  o  U  M).  If  we  happen  to  know 

p{°Rm  I  pa Qra{RM)  \  R-M,0RMnpagm(iiM)) 

for  every  Rm  £  Rm  as  a  function  of  the  manifest,  this 
would  suffice  due  to  the  following  result. 


The  algorithm,  which  we  call  MID,  work  as  follows.  It 
tries,  for  every  Rm  £  Rm.  to  recover 

p(°Rm  I  pa gm(RM)  \  R-M,Opagm(RM)nRM) 

via  a  subroutine  DIR.  If  every  such  conditional  distribu¬ 
tion  is  recovered,  MID  recovers  p(OU  M)  via  (3),  other¬ 
wise  MID  fails. 

The  subroutine  DIR  (so  named  for  its  resemblance  to  the 
way  the  ID  algorithm  operates  when  identifying  controlled 
direct  effects)  has  three  cases.  The  first  case,  which  is  suf¬ 
ficient  for  obtaining  the  soundness  part  of  Theorem  3,  at¬ 
tempts  to  check  if  indicators  for  missing  parents  of  Rm  are 
non-parental  non-descendants  of  Rm,  in  which  case  recov¬ 
erability  of  the  conditional  distribution  for  Rm  is  immedi¬ 
ate. 

Otherwise,  DIR  uses  the  other  two  cases  to  isolate  Rm 
and  its  parents  into  smaller  subproblems  based  on  a  partic¬ 
ular  type  of  ancestral  set  A^,  or  the  clan  D^  of  Rm-  DIR 


Lemma  2  Under  M{Gm),  if  for  every  Rm  G  Rm, 

p(0rm  I  pa \  R-m,  Opagm  (RM)nR,M  ) 

is  afunctional  /rm(.)  ofp( O,  SM,  0rm),  then 

( r f-.  q  r /n  \\  p(0,  Sm,  Orm) 

p({O,Sm}(0rm))  = 


rUM6RM  f Rm (p(0,  ®m;  0rm)) 


Proof:  p({0,  Sm}(0rm))  =  EmP(IA  \  rm}(0rm))- 

p{  A  \  Rm,  0Rm) 


p({ A  \  Rm}(0rm))  = 


n 


-RmGRm 


/am(p(0,s  M,0rm)) 


is  implied  by  (3).  But  no  denominator  is  a  function  of  M, 
so  we  can  apply  the  sum  to  the  numerator  first.  □  Finding 
functionals  f rm  (.)  for  every  Rm  in  order  to  apply  Lemma 
2  is  the  job  of  the  subroutine  DIR. 


Soundness  of  DIR 


Algorithm  1  £Jm(V,M)  a  missingness  graph,  p(V) 
a  manifest  distribution  from  p(A)  G  A4(Qm(V,  M)), 
Pw(V)  a  family  of  manifest  distributions  from  elements 
of  p( A)  G  ,  M)),  -<  a  topological  order  on  Qm. 

procedure  MID(0m(V,  M),p(V)) 
for  each  RM  G  RM, 

p(®Rm  I  Pae™  (Rm)  \  Rm,  Opagm  (itM)nRM) 

<- DIR(0"*, p.ifo) 

if  (3 Rm  G  Rm),  s.t.  DIR (Gm,p,  Rm)  =  0, 

return  “cannot  recover.” 
else  return 

_ p{ O,  SM,  0rm) _ 

n«MSRMP(0«M  I  Pa6m  (Rm)  \  Rm,  0pagm  (ijM)nRM ) 

end  procedure 

procedure  DIR(C/m(V,  M  |  W),pw(V),RM) 
if  RMn(pagm(itM)\w)  C  ndpgm(f?M),  return 


The  subroutine  DIR  invoked  by  MID  aims  to  recover 

IraAp)  —  p(®Rm  I  pa6™(i?M)  \  Rm, Opaem(/jM)nRM) 

by  recursively  attempting  to  restrict  Rm  and  pa^m  (Rm)  to 
either  an  appropriate  ancestral  subset  containing  these  ver¬ 
tices,  or  an  appropriate  clan  of  Qm,  and,  in  the  base  case, 
exploiting  the  independence  structure,  and  properties  of  the 
subproblem  that  is  left. 

To  prove  the  soundness  of  DIR,  we  must  establish,  by 
induction  on  algorithm  structure,  certain  results  about  the 
subproblems  it  considers.  We  will  represent  subproblems 
as  a  pair  consisting  of  a  CDAG  Qm  that  is  a  subgraph  of  the 
original  graph  Qm,  and  a  conditional  fragment  of  the  miss¬ 
ingness  model  which  can  be  viewed  as  a  set  of  all  interven¬ 
tional  distributions  relevant  to  the  subproblem,  which  also 
possibly  depend  on  variables  W  from  larger  subproblems. 

Given  an  element  p(A)  of  M (Qm(A)),  BC  A,  and  W  C 
A  \  B,  a  conditional  fragment  of  p(  A)  with  respect  to  B 
and  W,  denoted  by  Jw.n,  is  a  mapping  from  elements  w 
in  Xw  to 

w.b  =  {Pw(B(r)w)  |  R  C  Rm  n  B.  r  G  Xr}  . 


Pw 


papm  (Rm)  \  (M  U  W  U  Rm) 

^R-IVinpagm  (Hm)  >  ®Mnpagm  (Rm) 


0 


RulXpagm  (Rm)\ W) 


else  Ai  4—  {Rm}- 

while  (ana™(At)  U  Sangm(At)nM  2  At)  do 
Ai  4  anpm(Ai)  U  Sangm(At)nM- 

if  Ai  c  A, 

return  DIR(C/™t , pWnAt  (V  n  Ai),  RM). 
D  4  disp™  (Rm),  D  4—  cla g™  (Em). 


if  D  c 

V, 

zt 

4— 

Pa0(v)  (D) 

n  Rm 

Yt 

4— 

Pa^v)(D) 

\  Rm 

M^t 

4— 

{M  G  (M 

n  Df)  1  Rm  g  Zt} 

M^t 

4— 

(MnDf) 

\M£,t 

Vi  4—  D  U 

Gm  4—  (yi, M^t  |Yi) 


pYt  (D)  <  1[  Pw 

VgD 


Prearv),^(^)\Zt,  \ 

°Pregm)>x(v)nzt  j 


return  DIR((?m.  /;Yt  (D),  RM) 

end  if 
return  0. 
end  procedure 


Note  that  we  cannot  view  pw(B(r)w)  as  a  joint  response 
of  B  to  an  intervention  setting  RUW  to  rUw,  because  W 
may  contain  elements  outside  R  that  we  are  not  allowed  to 
intervene  on. 

For  each  call  to  DIR,  we  want  to  show  that  all  interven¬ 
tional  distributions  in  the  input  fragment  are  Markov  with 
respect  to  the  appropriately  modified  input  graph,  that  we 
have  enough  information  in  the  subproblem  to  possibly 
obtain  /n;V,  (p),  and  that  the  manifest  distribution  of  the 
fragment  for  the  current  (inner)  call  can  be  obtained  from 
the  manifest  distribution  of  the  fragment  for  the  previous 
(outer)  call. 

Definition  2  Jw  b  is  causal  Markov  relative  to  a  CDAG 
Qm( B  |  W)  if  (Vw  G  Iw,Pw(B(r)w)  G  .Fw,b), 
pw(B(r)w)  is  Markov  relative  to  Qm( B  |  W)r. 

This  definition  is  how  we  will  relate  fragments  and  corre¬ 
sponding  subgraphs,  and  the  following  two  results  estab¬ 
lish  this  relationship  for  the  two  recursive  cases  relevant 

for  DIR. 

Lemma  3  For  Jw.a  causal  Markov  relative  to  G"'  (A  \ 
W),  let  D  G  D(g^}),  Dt  =  clagm  (D),  W*  = 
pa§m  (D),  W*  =  Wt  \  W.  Then  for  any  w  '  G  Xwt, 

^wt,Dt  =  {pwt  (Dt(r)wt)|r  G  XR,  R  C  Rm  n  D}  is 
causal  Markov  relative  to  !?f™gm(Dt  )(D^  |  Wt),  where  for 
any  w  consistent  with  wt,  pw t  (Dt  (r)wt )  is 

^Qpw(G|(r  LJ  wt)Pagm(r)n(Ruw*),paem  (V)  \  (RUW1)) 
veDt 


Proof:  For  any  CDAG  £7(0,  M  |  W),  fag(clag(D))  is 
equal  to  clap(D)  U  pa|  (D)  for  any  D  £  'D(G( o))-  The 
proof  is  now  immediate.  Elements  pwt  (Dt  (r)wt )  of  each 
Jwt,Dt  are  Markov  relative  to  (G^gnl  ^Dt) )r  by  construc¬ 
tion.  The  definition  of  pwt  (Dt  (r) wt )  implies  it  is  the  same 
object  for  any  w  consistent  with  wt.  □ 

Lemma  4  For  Jw.a  causal  Markov  relative  to  £7m(A  | 
W),  let  Vt  C  A  U  W  be  ancestral,  Wt  =  W  (T 
Vt,  At  =  A  fl  Vt.  Then  for  any  wt  £  Xwt, 
R ivt,Dt  =  {pwt(At(r)wt)|r  €  Xr,R  C  Rm  n  At}  is 
causal  Markov  relative  to  £7j('t  (A  '  |  Wt),  where  for  any 
w  consistent  with  wt,  pwt  (A'  (r)wt )  is 

n  P^V  I  rpaCm (v)nRi  papm  (V)  \  (Ru  Wt)) 

veAt 

Proof:  Immediate.  Elements  pwt(At(r)wt)  of  each 
J„t  At  are  Markov  relative  to  (GrfT  )r  by  construction. 
The  definition  of  pw t  (Dt(r)wt )  implies  it  is  the  same  ob¬ 
ject  for  any  w  consistent  with  wt.  □ 

The  next  two  results  re-express  p(Rm  |  pa gm(R.M))  from 
a  function  of  the  larger  fragment  of  the  outer  recursive  call 
to  a  function  of  the  smaller  fragment  of  the  inner  call. 


Then  the  marginal  pYt  o  (Ot,  M2,t,  SMh  ,  RMh  )  of 

’  ZT  1 J  Dt  Dt 

Pwt(D^)  G  Jwt.Dt  is  equal  to 

Tlyevt  Pw(V  |  prea™)i^(E)  \  Zt,Opre  (y)nzt)- 

Proof:  Fix  w  and  w'  consistent  with  w,  such  that  w^t  = 
0.  We  get  the  following  set  of  equalities,  where  the  first 
is  by  assumption  on  missingness  models,  the  second  by 
(4),  (3)  and  the  definition  of  M^t,  the  third  by  defini¬ 
tion,  the  fourth  by  Lemma  3,  and  the  last  by  standard  re¬ 
sults  on  district  factorization  of  hidden  variable  DAG  mod¬ 
els  found  in  [22],  If  we  range  over  all  possible  w^,  the 
last  expression  reduces  to  Jlt'evt  Pw(y  \  pre^m  X(V)  \ 

(V)nZt). 

Pwt  (Ot ,  Mq;  ,  SMh  ,  RMh  ) 

DT  DT 

=  pwt  (O  ,  Sm°  t  (Rm“  t  =  0)  i  SMh  ,  Rmi.  ) 

DT  DT  Dl  DT 

=  pwt  (O  ,  Sm°  t  >  SMh  ,  RMh  ) 

DT  DT  DT 

=  pwt  (Ot ,  Sm°  ,  SMh  ,  RMh  ,  Mpt ) 

z '  DT  DT  DT 

Mh  , 

DT 

=  5Z  IIPw(y  I  Wpas  m(v)n(wt\w)’Paem(T)  \  Wt) 

Mh+  vent 

DT 

=  n^w(^|prea^)>^(E)\Wt,wtresm  ^(y)nwt) 

vcn  <v)' 


Lemma  5  Assume  J- w.a  is  causal  Markov  relative  to 
£7m(A  |  W),  and  Jwt  .Dt  is  defined  as  in  Lemma  3.  Then 
for  any  Rm  €E  Dt,  pw{Rm  I  P&c™(Rm)  \  W)  is  equal 
to  Pw+(^m  |  pa  am  .  {Rm)  \  Wt). 

fagmfDt) 

Proof:  Since  Rm  €  Dt,  this  follows  by  Lemma  3.  That  is, 
Pwt  {Rm  |  pap™  ,  {Rm)  \  Wt)  is  equal  to  Pw{Rm  \ 

fasm(Dt) 

Wpacm(RM)n(wt\w)’Paa-(fiM)\Wt),which  is  equal 

to  pw{Rm  |  pagm  {Rm)  \  W).  □ 

Lemma  6  Assume  Jw.a  is  causal  Markov  relative  to 
£7m(A  |  W),  and  At  is  defined  as  in  Lemma  4.  Then 
for  any  Rm  &  AT,  pw(fiM  |  pa pm(f?M)  \  W)  is  equal 
to  Pwt  {Rm  |  pa  cm  {Rm)  \  W’t). 

AT 

Proof:  Since  Rm  G  At,  this  follows  by  Lemma  4.  That 
is,  pwt  {Rm  |  pagm  {Rm)  \  Wt)  is  equal  to  pw{RM  \ 
paGm  {Rm)  \  W).  □ 

The  next  two  results  express  the  analogue  of  the  manifest 
distribution  of  the  smaller  fragment  as  a  function  of  the 
manifest  distribution  of  the  larger  fragment.  We  assume 
Mpt,  M^t,  Vt,  Zt,  Yt,  and  £7m  are  defined  as  in  the  dis¬ 
trict  case  of  DIR.  Let  Wt  =  Yt  U  Zt,  and  Ot  =DflO. 

Lemma  7  Assume  Jw.a  is  causal  Markov  relative  to 
£7m(A  |  W),  and  Jwt.Dt  is  defined  as  in  Lemma  3. 


□ 


Lemma  8  Assume  Jw.a  is  causal  Markov  relative  to 
Gm{A  |  W),  and  Jwt .At  is  defined  as  in  Lemma  4. 
Then  the  element  pwt(V  H  At)  of  Jwt  At  is  equal  to 

Z/v\AtPw(V). 

Proof:  pwt  (V  n  At)  is  equal  to  EAt\vPwt(Af)  (by 
definition),  which  is  equal  to  J^At\v  X^A\At  Pw(A) 
by  Lemma  4.  But  since  both  V,  At  are  subsets  of 
A,  this  is  just  EA\(vnAt)Pw(A),  which  is  equal  to 
Ev\At  Ea\V  Pw(-A)  =  Ev\At  Pw(A).  □ 

The  following  result  establishes  the  validity  of  the  base 
case  of  DIR,  where  pw{Rm  |  pa g™{Rm)  \  W)  is  ex¬ 
pressed  in  terms  of  the  manifest  distribution  for  the  current 
fragment. 


Lemma  9  Assume  Jw.a  is  causal  Markov  relative  to 
£7m(A  |  W).  Then  if  RMn(Pagm(RM)\W)  Q 

ndp gm(f?Af),  then 

Pw(0rm  |  paam(f?M)\(WuRM),  0R„n(Pasm(fl„)\w)) 


is  equal  to  pw 


pa gm(i?Af)  \  (M  U  W  U  Rm) 

0RMnpagm  ( rm )  >  ®Mnpagm  (Rm) 
0RMn(pagm(RM)\W) 


Proof:  We  get  the  following  set  of  equalities,  where  the  first 
follows  by  assumption,  and  the  fact  thatpw(A)  is  Markov 
relative  to  Qm,  the  second  is  by  the  properties  of  the  miss¬ 
ingness  model,  and  the  third  is  by  (4): 


Pw  (  0fljl 


PW  0 


Pw  0  ra 


pa gm  ( Rm )  \  (W  U  Rm) 
ORMn(paem(_RM)\W) 

pagm  (Rm)  \  (M  U  W  U  Rm  )  \ 

pagm(RM)  nM,0RMnpa6m(%) 

°RMn(pagm  (Hm)\w)  / 

pagm  (Rm)  \  (M  U  W  U  Rm), 

Sjyinpagm  (-Rm)  (®RMnP>5m  (%)) 
^RMnP»5m  (Rm)  ’  ^RMn(Pasm  (%)\w) 

pa  gm  (Rm)  \  (M  U  W  U  Rm) 

®RMnPa5m  (Rm)  ’  ^MnPa5”  (Rm) 

Or„,D 


for  all  values  of  w,  including  those  that  set  Z  to 
0.  Lemma  5,  and  the  inductive  hypothesis  ensures 
Pwt  {Rm  |  pa (Rm)  \  W1')  is  equal  to  pw(Rm  \ 

*{a.gm(t>t) 

pa^m  (Rm)  \  W).  Finally,  Lemma  7  ensures  the  manifest 
for  the  recursive  call  is  a  function  of  the  input  manifest. 
In  fact,  because  we  set  7f  to  0,  properties  of  missingness 
models  ensure  we  can  treat  as  observed  in  subsequent 
recursive  calls,  which  means  we  no  longer  need  to  consider 

SmV 

DT 

Since  induction  follows  for  all  cases,  so  does  our  conclu¬ 
sion.  □ 

6  A  COMPLEX  RECOVERABLE 
EXAMPLE 


□ 

Before  putting  all  these  results  together  to  show  soundness 
of  DIR,  we  must  prove  one  additional  utility  lemma  that 
shows  the  set  constructed  by  DIR  is  ancestral. 

Define  an  automorphism  from  vertex  sets  in  Qm, 
PM,g™( B),  as  an g™({RM}  U  B)  U  SangTO({flM}UB)-  Let 

be  the  fixed  point  of  pM,gm  with  the  starting  input  of 
the  empty  set. 

Lemma  10  A'  is  an  ancestral  set  in  Q"1: 

Proof:  A  simple  proof  by  contradiction  follows  by  defini¬ 
tion  of  pM,gm-  □ 

We  now  show  the  main  result  of  this  paper. 

Theorem  4  MID  is  sound. 

Proof:  Assuming  DIR  returns  the  answer  for  every  Rm  G 
Rm.  Corollary  1,  and  Lemma  2  ensure  that  MID  recovers 
p( M  U  O)  fromp(V). 

The  soundness  of  DIR  follows  by  induction  on  the  recur¬ 
sive  call  structure.  The  inductive  hypothesis  is  that  the  in¬ 
put  conditional  fragment  J~  ^  is  causal  Markov  relative 

to  the  appropriate  graph  derived  from  the  input  graph  Qm, 
that  the  input  manifest  Pw(V)  is  the  function  of  the  origi¬ 
nal  manifest  p(V),  and  that |  pagm(f?M)\W)  = 
p(Rm  |  pagm(i?Af)). 

The  base  case  trivially  holds  for  the  original  inputs  to  DIR. 
If  the  inductive  hypothesis  is  true,  and  DIR  returns  after 
the  first  conditional,  soundness  follows  by  Lemma  9. 

If  DIR  returns  after  the  second  conditional,  then  Lemma 
10  ensures  the  constructed  set  A  is  ancestral,  and  the  in¬ 
duction  for  the  following  recursive  call  is  maintained  via 
Lemmas  4,  8  and  6. 

If  DIR  returns  after  the  third  conditional.  Lemma  3 
ensures  J^.Dt  is  causal  Markov  relative  to  G^gm  ;Dt , 


We  now  work  through  an  example  where  all  cases  of  MID 
and  DIR  are  necessary.  Consider  the  graph  shown  in  Fig. 
2  (a).  Here  C  and  I)  are  shown  in  green  to  indicate  that 
they  are  fully  observed.  This  is  a  more  complex  version 
of  the  example  in  Fig.  1  (c).  Unlike  that  case,  here, 
there  are  no  conditional  independences  that  hold  between 
proxies  and  indicators.  However,  if  we  were  to  divide  by 
p(D  |  C)  and  sum  out  C,  in  the  resulting  distribution 
Pd(Sa ,  Sr,  Ra,  Rb,A ,  B),  for  any  fixed  value  d  of  D.  we 
would  have 

({Sa(0Ra),Rb}  A  {SB(0O,  A4(0*B)})pd 

This  is  a  type  of  Verma  constraint  [23]  or  generalized  inde¬ 
pendence  constraint  [20]. 

Our  goal  is  to  recover  p(A,  B,  C,  D)  given  the  missing¬ 
ness  model  corresponding  to  this  graph,  and  in  particular 
the  above  constraint.  We  must  recover  p(0rb  \  A,  D) 
and  p(0ra  \  0rb  ,  B,  D)  from  p(RA,  Rb,  Sa,  Sb,  C,  D). 
In  either  case,  we  note  that  D  is  not  an  element 
of  clap™  (JiU)  =  cla which  implies  we  can 
use  the  clan  case  of  DIR  and  consider  a  subproblem 
shown  in  Fig.  2  (b),  with  the  corresponding  mani¬ 
fest  pD(RA,  Rb,  Sa,  Sb,C)  =  p(Sa,Sr,Ra,Rb  I 
D,C)p(C).  In  the  new  subproblem  (for  either  Ra  or 
Rb),  C  is  not  a  part  of  the  ancestral  set  A;  constructed 
by  DIR  in  the  ancestral  case,  so  we  consider  a  new  sub¬ 
problem  shown  in  Fig.  2  (c),  with  the  corresponding  man¬ 
ifest  Pn(R’A ■  Rb ,  Sa ,  SB)  =  J2cPd(Sa,Sb,Ra,Rb  I 
D,c)pd(c).  This  new  subproblem  now  resembles  the  ex¬ 
ample  in  Fig.  1  (c),  and  is  solved  similarly.  In  particular, 
we  recover  p(0rb  \  D,A)  as 

Pd(Sa  1  0ra,  0rb)Pd(Qrb) 

ErbPd(Sa  |  0ra,Rb)pd(Rb) 

and  p(0ra  \  0Rb  ,  B,  D)  as  pd(0Ra  \  0 rb,Sb).  We 
then  obtain  p(A,  B,  C,  D)  by  dividing  the  manifest  distri¬ 
bution  for  observed  cases  p(0ra,0rb,  Sa,  Sr,C,  D)  by 
the  above  two  probabilities. 


Figure  2:  (a)  An  example  where  recoverability  is  possi¬ 
ble  via  MID.  (b),(c)  Graphs  corresponding  to  subproblems 
considered  by  MID  in  recovering  p{A,  B.  C .  D). 


7  NONRECOVERABILITY 

The  generality  of  MID  naturally  raises  the  question  of 
whether  it  is  complete,  that  is  whether  whenever  it  outputs 
“cannot  recover”  then  it  is  possible  to  construct  two  ele¬ 
ments  of  the  missingness  model  that  agree  on  the  manifest 
but  disagree  on  the  underlying  joint  distribution.  We  leave 
this  difficult  question  aside  in  this  paper  in  the  interests  of 
space,  but  note  that  an  approach  similar  to  one  used  to  show 
completeness  for  causal  effects  identification  [18]  seems 
promising.  That  is,  use  MID  as  a  guide  for  constructing 
a  “zoo”  of  structures  where  recoverability  does  not  seem  to 
be  possible,  and  then  construct  a  general  method  for  show¬ 
ing  non-recoverability  for  this  “zoo.” 

Some  results  on  non-recoverability  do  exist.  For  example, 
it  can  be  shown  that  p(A)  is  not  recoverable  in  the  miss¬ 
ingness  model  with  the  graph  in  Fig.  3  (a)  [7],  and  sim¬ 
ilarly  that  p(A,B)  is  not  recoverable  in  the  missingness 
model  with  the  graph  in  Fig.  3  (b).  Characterization  of 
non-recoverability  is  an  open  problem. 

8  DISCUSSION  AND  CONCLUSIONS 

We  have  represented  missing  data  as  a  type  of  a  restricted 
causal  inference  problem.  Using  the  machinery  of  graphi¬ 
cal  causal  models,  we  have  given  a  general  algorithm  for 
recoverability  of  a  joint  distribution  in  MNAR  settings. 
Though  we  do  not  require  this,  our  formalism  allows  the 
joint  distribution  we  recover  to  come  from  a  statistical, 
rather  than  a  causal  model  -  all  causal  assumptions  may  be 
restricted  to  the  missingness  model  governing  the  behavior 
of  proxies  of  missing  variables  under  interventions  on  indi¬ 
cators.  We  show  that  the  MCAR,  MAR,  MNAR  taxonomy 
is  not  sufficiently  granular  to  classify  cases  where  recover¬ 
ability  is  possible.  In  particular,  there  are  MNAR  examples 
where  constraints  akin  to  Verma  constraints  permit  recov¬ 
erability. 


Figure  3:  (a)  p(A)  is  not  recoverable,  (b)  p(A,  B)  is  not 
recoverable,  (b)  A  graph  with  hidden  variables  where  p(Y) 
is  not  recoverable,  but  p(Y (a))  is. 


the  presence  of  an  unobserved  parent),  p(Y)  is  not  recover¬ 
able.  However,  if  the  graph  on  C,  A,  Y  represents  a  causal 
model,  we  can  show  \hsAp{Y{a))  is  recoverable.  In  partic¬ 
ular 


p(Y(a ))  =p{SY(a,  0Ry)) 


J2cP(Sy,0ry  I  a,c)p(c) 

E cp(°Ry  I  «>CMC) 


A  similar  observation  appears  in  [6],  example  3. 


By  explicitly  representing  missingness  via  an  intervenable 
indicator,  and  a  proxy  as  a  response  to  this  intervention,  our 
formalism  allows  us  to  reason  explicitly  about  the  interpre¬ 
tation  of  censoring  by  death  using  the  existing  language  of 
interventions.  That  is  if  Sx  is  observed  patient  history,  and 
\ftx  implies  it  is  missing  due  to  the  patient  dying,  then  we 
may  either  disallow  considering  Sx(0rx)  (e.g.  “resurrect¬ 
ing  the  patient”)  for  that  patient,  allow  Sx{0rx ),  but  treat 
it  as  making  statements  about  exchangeable  but  different 
patients  who  happened  to  be  alive  that  transfer  over  to  the 
dead  patient  in  a  hypothetical  alternative  history  where  the 
patient  never  died,  and  so  on. 

Note  that  if  we  assume  a  known  relationship  p(S m  (r rm  )  | 
M)  between  M  and  Sm{trm  )  other  than  direct  equality, 
we  can  use  the  approach  in  this  paper  to  address  certain 
coarsening  [3]  and  measurement  error  settings.  We  do  not 
consider  these  extensions  explicitly  here  for  space  reasons, 
but  they  are  straightforward. 
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